Project 2: Classifaction Models¶

Classification is the process of predicting the class of given data points. Classes are sometimes called as targets/ labels or categories. Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y).

For example, spam detection in email service providers can be identified as a classification problem. This is s binary classification since there are only 2 classes as spam and not spam. A classifier utilizes some training data to understand how given input variables relate to the class. In this case, known spam and non-spam emails have to be used as the training data. When the classifier is trained accurately, it can be used to detect an unknown email.

Classification belongs to the category of supervised learning where the targets also provided with the input data. There are many applications in classification in many domains such as in credit approval, medical diagnosis, target marketing etc.

Load Libraries¶

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px

Import Data¶

In [2]:
drug_df = pd.read_csv("drug200.csv")
In [3]:
drug_df.head()
Out[3]:
Age Sex BP Cholesterol Na_to_K Drug
0 23 F HIGH HIGH 25.355 DrugY
1 47 M LOW HIGH 13.093 drugC
2 47 M LOW HIGH 10.114 drugC
3 28 F NORMAL HIGH 7.798 drugX
4 61 F LOW HIGH 18.043 DrugY
In [4]:
drug_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          200 non-null    int64  
 1   Sex          200 non-null    object 
 2   BP           200 non-null    object 
 3   Cholesterol  200 non-null    object 
 4   Na_to_K      200 non-null    float64
 5   Drug         200 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 9.5+ KB

We can see that there are no missing/null values in the data set.

In [5]:
drug_df.Sex.value_counts()
Out[5]:
M    104
F     96
Name: Sex, dtype: int64
In [6]:
drug_df.BP.value_counts()
Out[6]:
HIGH      77
LOW       64
NORMAL    59
Name: BP, dtype: int64
In [7]:
drug_df.Cholesterol.value_counts()
Out[7]:
HIGH      103
NORMAL     97
Name: Cholesterol, dtype: int64
In [8]:
drug_df.describe()
Out[8]:
Age Na_to_K
count 200.000000 200.000000
mean 44.315000 16.084485
std 16.544315 7.223956
min 15.000000 6.269000
25% 31.000000 10.445500
50% 45.000000 13.936500
75% 58.000000 19.380000
max 74.000000 38.247000
In [9]:
print('Age skewness: ', drug_df.Age.skew(axis=0, skipna=True))
Age skewness:  0.03030835703000607
In [10]:
print('Na_to_K skewness: ', drug_df.Na_to_K.skew(axis=0, skipna=True))
Na_to_K skewness:  1.039341186028881
In [11]:
sns.displot(drug_df['Age'], kde=True)
Out[11]:
<seaborn.axisgrid.FacetGrid at 0x7fc2d4c05e20>
In [12]:
sns.displot(drug_df['Na_to_K'], kde=True)
Out[12]:
<seaborn.axisgrid.FacetGrid at 0x7fc2d4f23340>

Exploratory Data Analysis¶

In [13]:
fig = px.histogram(drug_df, x='Drug',color='Sex', height=500, width=900)
fig.update_layout(
    template="seaborn",barmode='group', xaxis={'categoryorder':'total descending'},
    title='Distribution of Drug Types by Gender')
fig

DrugX are taken by equal numbers of males and females. DrugY mostly taken by females. DrugA, DrugB and drugC are mostly taken by male patients.

In [14]:
fig = px.histogram(drug_df, x='Age',color='Cholesterol', height=500, width=900)
fig.update_layout(
    template="seaborn",barmode='group', xaxis={'categoryorder':'total descending'},
    title='Distribution of Age by Cholesterol Level')
fig
  • More patients in the 35-39 age group have normal cholesterol levels than any other age group.

  • More patients in the 55-59 age group have high cholesterol levels than any other age group.

In [15]:
fig = px.bar(drug_df, x='Age',y='Na_to_K', height=500, width=900)
fig.update_layout(
    template="seaborn",barmode='stack', xaxis={'categoryorder':'total descending'},
    title='NA/K Levels by Age')
fig
In [16]:
fig = px.histogram(drug_df, x='Age',color='BP', height=500, width=900)
fig.update_layout(
    template="seaborn",barmode='group', xaxis={'categoryorder':'total descending'},
    title='Distribution of Blood Pressure Levels by Age')
fig

Most patients in the 45-49 age group have low blood pressure. While most patients in the 20-29 age group have either high or normal blood pressure.

In [17]:
fig = px.histogram(drug_df, x='Age', color='Sex', height=500, width=900)
fig.update_layout(
    template='seaborn', barmode='group', xaxis={'categoryorder':'total descending'},
    title='Distribution of Age by Gender')

Most males are in the 45-49 age group. And most females are in the 35-39 and 55-59 age groups.

In [18]:
fig = px.scatter(drug_df, x='Drug',y='Age', height=500, width=600)
fig.update_layout(
    template="seaborn",barmode='group', xaxis={'categoryorder':'total descending'},
    title="Distribution of Patient Ages by Drug Type")
fig

All patients taking Drug B are over the age of 50, and all patients taking Drug A are under 51.

In [19]:
fig = px.histogram(drug_df, x='BP',color='Sex', height=500, width=600)
fig.update_layout(
    template="seaborn",barmode='group', xaxis={'categoryorder':'total descending'},
    title='Distribution of Blood Pressure Levels by Age ')
fig

Most patients have high blood pressure.

In [20]:
fig = px.histogram(drug_df, x='Sex',color='Cholesterol', height=500, width=600)
fig.update_layout(
    template="seaborn",barmode='group', xaxis={'categoryorder':'total descending'},
    title='Distribution of Gender by Cholesteral Level')
fig

More males have high and normal chelesterol than females.

In [21]:
fig = px.scatter(drug_df, x='Sex',y='Na_to_K', height=500, width=600)
fig.update_layout(
    template="seaborn",barmode='group', xaxis={'categoryorder':'total descending'},
    title='Distribution of Na/K Levels by Gender')
fig
  • Females have a Na to K ratio between 6.6-38.
  • Males have a Na to K ratio between 6.2-35.
In [22]:
fig = px.histogram(drug_df, x='BP',color='Cholesterol', height=500, width=600)
fig.update_layout(
    template="seaborn",barmode='group', xaxis={'categoryorder':'total descending'},
    title='Distribution of Blood Pressure levels by Cholesterol Levels')
fig
  • 17.5% of patients have high blood pressure and cholesterol.
  • 21% of patients have high blood pressure and normal Cholesterol.
  • 15% of patients have low blood pressure and high Cholesterol.
  • 6.5% of patients have low blood pressure and normal Cholesterol.
  • 18.5% of patients have normal blood pressure and high Cholesterol
In [23]:
drug_df.loc[(drug_df['BP']=='HIGH') & (drug_df['Cholesterol']=='HIGH'),
            'Drug'].value_counts().plot(kind='pie',autopct='%1.1f%%',title='High Blood Pressure and High Cholestrol')
Out[23]:
<AxesSubplot:title={'center':'High Blood Pressure and High Cholestrol'}, ylabel='Drug'>

Most patients with high blood pressure and high cholestrol take Drug Y.

In [24]:
drug_df.loc[(drug_df['BP']=='HIGH') & (drug_df['Cholesterol']=='NORMAL'),
            'Drug'].value_counts().plot(kind='pie',autopct='%1.1f%%',title='High Blood Pressure and Normal Cholestrol')
Out[24]:
<AxesSubplot:title={'center':'High Blood Pressure and Normal Cholestrol'}, ylabel='Drug'>

Most patients with high blood pressure and normal cholestrol take Drug Y.

In [25]:
drug_df.loc[(drug_df['BP']=='LOW') & (drug_df['Cholesterol']=='HIGH'),
            'Drug'].value_counts().plot(kind='pie',autopct='%1.1f%%',title='High Blood Pressure and Normal Cholestrol')
Out[25]:
<AxesSubplot:title={'center':'High Blood Pressure and Normal Cholestrol'}, ylabel='Drug'>
In [26]:
drug_df.loc[(drug_df['BP']=='HIGH') & (drug_df['Cholesterol']=='HIGH'),
            'Drug'].value_counts().plot(kind='pie',autopct='%1.1f%%',title='High Blood Pressure and Normal Cholestrol')
Out[26]:
<AxesSubplot:title={'center':'High Blood Pressure and Normal Cholestrol'}, ylabel='Drug'>
In [27]:
drug_df.loc[(drug_df['BP']=='NORMAL') & (drug_df['Cholesterol']=='HIGH'),
            'Drug'].value_counts().plot(kind='pie',autopct='%1.1f%%',title='High Blood Pressure and Normal Cholestrol')
Out[27]:
<AxesSubplot:title={'center':'High Blood Pressure and Normal Cholestrol'}, ylabel='Drug'>
In [28]:
drug_df.loc[(drug_df['BP']=='NORMAL') & (drug_df['Cholesterol']=='NORMAL'),
            'Drug'].value_counts().plot(kind='pie',autopct='%1.1f%%',title='High Blood Pressure and Normal Cholestrol')
Out[28]:
<AxesSubplot:title={'center':'High Blood Pressure and Normal Cholestrol'}, ylabel='Drug'>
In [29]:
fig = px.histogram(drug_df, x='Drug',color='BP', height=500, width=600)
fig.update_layout(
    template="seaborn",barmode='group', xaxis={'categoryorder':'total descending'},
    title='Distribution of Drug Type by Blood Pressure Level')
fig
  • Patients with low blood presure mostly take drugs X and C.
  • Drugs A and B are only taken by patients with low blood pressure.
  • Most patients with normal blood pressure take drug X.
  • Drug Y is taken by all patients.
In [30]:
fig = px.scatter(drug_df, x='Cholesterol',y='Na_to_K', height=500, width=600)
fig.update_layout(
    template="seaborn",barmode='group', xaxis={'categoryorder':'total descending'},
    title='Distribution of Na/K Ratios by Cholesterol Levels')
fig
  • Patients with high cholesterol levels have Na_to_K ratios 6.7-38.24.
  • Patients with normal cholesterol levels have Na_to_K ratios 6.2-35.63.
In [31]:
fig = px.histogram(drug_df, x='Drug',color='Cholesterol', height=500, width=600)
fig.update_layout(
    template="seaborn",barmode='group', xaxis={'categoryorder':'total descending'},
    title='Distribution of Drug Types by Cholesterol Levels')
fig
  • Drug X is taken mostly by patients with normal cholesterol levels.
  • Drug C is only taken by patients with high cholesterol levels.
In [32]:
fig = px.scatter(drug_df, x='Drug',y='Na_to_K', height=500, width=600)
fig.update_layout(
    template="seaborn",barmode='overlay', xaxis={'categoryorder':'total descending'},
    title='Distribution of Drug Types by Na/K Ratio')
fig
  • Drug Y is taken by patients with Na/K ratio over 15.
In [33]:
drug_df.loc[(drug_df['Na_to_K']<15),'BP'].value_counts().plot(kind='pie',autopct='%1.1f%%',title='Na to K under 15 effect on BP')
Out[33]:
<AxesSubplot:title={'center':'Na to K under 15 effect on BP'}, ylabel='BP'>
In [34]:
drug_df.loc[(drug_df['Na_to_K']>15),'BP'].value_counts().plot(kind='pie',autopct='%1.1f%%',title='Na to K over 15 effect on BP')
Out[34]:
<AxesSubplot:title={'center':'Na to K over 15 effect on BP'}, ylabel='BP'>
  • Patients Na/K ratio more than 15 are having more chance of High and low blood pressure.
  • Na/K ratio is responsible for the blood pressure falculations

Conclusion¶

  1. The patients are between 15 to 74 years old. Most people are 47 years old. Most males are between the 45-49 and most females are between 35-39 and 55-59 age range.
  2. There are more male patients than female.
  3. Most patients have high blood pressure. Most patients belong to the age group of 45-49 and have low blood pressure.
  4. 51.5% patients have high Cholesterol.
  5. Most patients have a sodium to potassium ratio between 10-12
  6. 5 types of drugs are taken by patients. Most patients are taking DrugY.
  7. All patients taking Drug B are over the age of 50, and all patients taking Drug A are under 51.
  8. Drug X is taken by equal numbers of males and females. Drug Y is mostly taken by females. Drug A, Drug B and Drug C are mostly taken by male patients.
  9. If blood pressure is high than they take Drug Y, Drug A and Drug B.
  10. Drug Y is the most common drug.
  11. Drug C is taken by the patients with low blood pressure and Drug B and Drug A are taken by high blood pressure patients.
  12. The Drug Y is taken by the patients with Na by K ratio higher than 15 and other drugs are taken by those patients with a Na by K ratio less than 15.
  13. Patients Na to K ratio more than 15 have a higher chance of High and low blood pressure. Na to K ratio is responsible for the BP falculations.
In [ ]:
 

Prepair The Data To Model¶

Data Bining¶

Age¶
  • Below 20 y.o.
  • 20 - 29 y.o.
  • 30 - 39 y.o.
  • 40 - 49 y.o.
  • 50 - 59 y.o.
  • 60 - 69 y.o.
  • Above 70.
In [35]:
bin_age = [0, 19, 29, 39, 49, 59, 69, 80]
category_age = ['<20s', '20s', '30s', '40s', '50s', '60s', '>60s']
drug_df['Age_binned'] = pd.cut(drug_df['Age'], bins=bin_age, labels=category_age)
drug_df = drug_df.drop(['Age'], axis = 1)
Na_to_K¶
  • Below 10.
  • 10 - 20.
  • 20 - 30.
  • Above 30.
In [36]:
bin_NatoK = [0, 9, 19, 29, 50]
category_NatoK = ['<10', '10-20', '20-30', '>30']
drug_df['Na_to_K_binned'] = pd.cut(drug_df['Na_to_K'], bins=bin_NatoK, labels=category_NatoK)
drug_df = drug_df.drop(['Na_to_K'], axis = 1)

Split the Data Set into Training and Test Sets¶

In [37]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

We need to separate the response variable from the predictor variables.

In [38]:
X = drug_df.drop(["Drug"], axis=1)
y = drug_df["Drug"]

To maximize the accuracy of our models with a small data set, we will split 33% test sample and 67% training sample.

In [39]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 0)
In [40]:
print('The Shape Of The Original Data: ', drug_df.shape)
print('The Shape Of The Test Sample: ', x_test.shape)
print('The Shape Of The Training Sample: ', x_train.shape)
print('The Shape Of The Test Sample: ', y_test.shape)
print('The Shape Of The Training Sample: ', y_train.shape)
The Shape Of The Original Data:  (200, 6)
The Shape Of The Test Sample:  (66, 5)
The Shape Of The Training Sample:  (134, 5)
The Shape Of The Test Sample:  (66,)
The Shape Of The Training Sample:  (134,)

This confirms that our test sample is 33% of the full data set.

In [41]:
drug_df['Drug'].value_counts()
Out[41]:
DrugY    91
drugX    54
drugA    23
drugC    16
drugB    16
Name: Drug, dtype: int64
In [42]:
drug_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   Sex             200 non-null    object  
 1   BP              200 non-null    object  
 2   Cholesterol     200 non-null    object  
 3   Drug            200 non-null    object  
 4   Age_binned      200 non-null    category
 5   Na_to_K_binned  200 non-null    category
dtypes: category(2), object(4)
memory usage: 7.3+ KB

Set the Dummy Variables¶

In [43]:
x_train = pd.get_dummies(x_train)
x_test = pd.get_dummies(x_test)
In [44]:
x_train.head()
Out[44]:
Sex_F Sex_M BP_HIGH BP_LOW BP_NORMAL Cholesterol_HIGH Cholesterol_NORMAL Age_binned_<20s Age_binned_20s Age_binned_30s Age_binned_40s Age_binned_50s Age_binned_60s Age_binned_>60s Na_to_K_binned_<10 Na_to_K_binned_10-20 Na_to_K_binned_20-30 Na_to_K_binned_>30
54 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0
163 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0
51 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1 0 0
86 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0
139 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0
In [45]:
x_test.head()
Out[45]:
Sex_F Sex_M BP_HIGH BP_LOW BP_NORMAL Cholesterol_HIGH Cholesterol_NORMAL Age_binned_<20s Age_binned_20s Age_binned_30s Age_binned_40s Age_binned_50s Age_binned_60s Age_binned_>60s Na_to_K_binned_<10 Na_to_K_binned_10-20 Na_to_K_binned_20-30 Na_to_K_binned_>30
18 0 1 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0
170 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0
107 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0
98 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1
177 0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 0
In [46]:
print("X_train", x_train.shape)
print("X_test", x_test.shape)
print("y_train", y_train.shape)
print("y_test", y_test.shape)
X_train (134, 18)
X_test (66, 18)
y_train (134,)
y_test (66,)

Balance the Data Set using the SMOTE Technique¶

SMOTE or Synthetic Minority Oversampling Technique is an oversampling technique but SMOTE working differently than your typical oversampling.

In a classic oversampling technique, the minority data is duplicated from the minority data population. While it increases the number of data, it does not give any new information or variation to the machine learning model.

SMOTE works by utilizing a k-nearest neighbour algorithm to create synthetic data. SMOTE first start by choosing random data from the minority class, then k-nearest neighbours from the data are set. Synthetic data would then be made between the random data and the randomly selected k-nearest neighbour. Let me show you the example below.

https://towardsdatascience.com/5-smote-techniques-for-oversampling-your-imbalance-data-b8155bdbe2b5

In [47]:
#pip install imblearn
In [48]:
sns.set_theme(style="darkgrid")
sns.countplot(y=y_train, data=drug_df, palette="mako_r")
plt.ylabel('Drug Type')
plt.xlabel('Total')
plt.show()

This shows us that the training set is not balanced.

In [49]:
from imblearn.over_sampling import SMOTE
x_train, y_train = SMOTE().fit_resample(x_train, y_train)
In [50]:
sns.set_theme(style="darkgrid")
sns.countplot(y=y_train, data=drug_df, palette="mako_r")
plt.ylabel('Drug Type')
plt.xlabel('Total')
plt.show()

This shows us that the data set has been balanced to the distribution of drug type.

Models¶

Logistic Regression¶

This type of statistical model (also known as logit model) is often used for classification and predictive analytics. Logistic regression estimates the probability of an event occurring, such as voted or didn’t vote, based on a given dataset of independent variables. Since the outcome is a probability, the dependent variable is bounded between 0 and 1. In logistic regression, a logit transformation is applied on the odds—that is, the probability of success divided by the probability of failure. This is also commonly known as the log odds, or the natural logarithm of odds.

https://www.ibm.com/topics/logistic-regression

In [51]:
from sklearn.linear_model import LogisticRegression

LRclassifier = LogisticRegression(solver='liblinear', max_iter=5000)
LRclassifier.fit(x_train, y_train)

y_pred = LRclassifier.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

from sklearn.metrics import accuracy_score

LRAcc = accuracy_score(y_pred,y_test)
print('Logistic Regression accuracy is: {:.2f}%'.format(LRAcc*100))
              precision    recall  f1-score   support

       DrugY       1.00      0.74      0.85        34
       drugA       0.71      1.00      0.83         5
       drugB       0.75      1.00      0.86         3
       drugC       0.67      1.00      0.80         4
       drugX       0.83      1.00      0.91        20

    accuracy                           0.86        66
   macro avg       0.79      0.95      0.85        66
weighted avg       0.90      0.86      0.86        66

[[25  2  1  2  4]
 [ 0  5  0  0  0]
 [ 0  0  3  0  0]
 [ 0  0  0  4  0]
 [ 0  0  0  0 20]]
Logistic Regression accuracy is: 86.36%

K-Nearest Neighbors¶

K-Nearest Neighbor is a lazy learning algorithm which stores all instances correspond to training data points in n-dimensional space. When an unknown discrete data is received, it analyzes the closest k number of instances saved (nearest neighbors)and returns the most common class as the prediction and for real-valued data it returns the mean of k nearest neighbors.

In the distance-weighted nearest neighbor algorithm, it weights the contribution of each of the k neighbors according to their distance using the following query giving greater weight to the closest neighbors.

Usually KNN is robust to noisy data since it is averaging the k-nearest neighbors.

https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623

In [52]:
from sklearn.neighbors import KNeighborsClassifier

KNclassifier = KNeighborsClassifier(n_neighbors=20)
KNclassifier.fit(x_train, y_train)

y_pred = KNclassifier.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

KNAcc = accuracy_score(y_pred,y_test)
print('K Neighbours accuracy is: {:.2f}%'.format(KNAcc*100))
              precision    recall  f1-score   support

       DrugY       0.92      0.68      0.78        34
       drugA       0.50      0.80      0.62         5
       drugB       0.67      0.67      0.67         3
       drugC       0.57      1.00      0.73         4
       drugX       0.78      0.90      0.84        20

    accuracy                           0.77        66
   macro avg       0.69      0.81      0.73        66
weighted avg       0.81      0.77      0.78        66

[[23  3  0  3  5]
 [ 0  4  1  0  0]
 [ 0  1  2  0  0]
 [ 0  0  0  4  0]
 [ 2  0  0  0 18]]
K Neighbours accuracy is: 77.27%

Support Vector Machine (SVM)¶

Support Vector Machine, abbreviated as SVM can be used for both regression and classification tasks. But, it is widely used in classification objectives. The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.

To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.

https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47

In [53]:
from sklearn.svm import SVC

SVCclassifier = SVC(kernel='linear', max_iter=251)
SVCclassifier.fit(x_train, y_train)

y_pred = SVCclassifier.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

SVCAcc = accuracy_score(y_pred,y_test)
print('SVC accuracy is: {:.2f}%'.format(SVCAcc*100))
              precision    recall  f1-score   support

       DrugY       1.00      0.68      0.81        34
       drugA       0.71      1.00      0.83         5
       drugB       0.75      1.00      0.86         3
       drugC       0.50      1.00      0.67         4
       drugX       0.83      1.00      0.91        20

    accuracy                           0.83        66
   macro avg       0.76      0.94      0.81        66
weighted avg       0.89      0.83      0.83        66

[[23  2  1  4  4]
 [ 0  5  0  0  0]
 [ 0  0  3  0  0]
 [ 0  0  0  4  0]
 [ 0  0  0  0 20]]
SVC accuracy is: 83.33%
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/svm/_base.py:301: ConvergenceWarning:

Solver terminated early (max_iter=251).  Consider pre-processing your data with StandardScaler or MinMaxScaler.

Naive Bayes¶

Naive Bayes is a probabilistic classifier inspired by the Bayes theorem under a simple assumption which is the attributes are conditionally independent.

The classification is conducted by deriving the maximum posterior which is the maximal P(Ci|X) with the above assumption applying to Bayes theorem. This assumption greatly reduces the computational cost by only counting the class distribution. Even though the assumption is not valid in most cases since the attributes are dependent, surprisingly Naive Bayes has able to perform impressively.

Naive Bayes is a very simple algorithm to implement and good results have obtained in most cases. It can be easily scalable to larger datasets since it takes linear time, rather than by expensive iterative approximation as used for many other types of classifiers.

Naive Bayes can suffer from a problem called the zero probability problem. When the conditional probability is zero for a particular attribute, it fails to give a valid prediction. This needs to be fixed explicitly using a Laplacian estimator.

https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623

1 Categorical Naive Bayes¶

In [54]:
from sklearn.naive_bayes import CategoricalNB

NBclassifier1 = CategoricalNB()
NBclassifier1.fit(x_train, y_train)

y_pred = NBclassifier1.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

NBAcc1 = accuracy_score(y_pred,y_test)
print('Naive Bayes accuracy is: {:.2f}%'.format(NBAcc1*100))
              precision    recall  f1-score   support

       DrugY       1.00      0.68      0.81        34
       drugA       0.62      1.00      0.77         5
       drugB       0.75      1.00      0.86         3
       drugC       0.57      1.00      0.73         4
       drugX       0.83      1.00      0.91        20

    accuracy                           0.83        66
   macro avg       0.76      0.94      0.81        66
weighted avg       0.88      0.83      0.83        66

[[23  3  1  3  4]
 [ 0  5  0  0  0]
 [ 0  0  3  0  0]
 [ 0  0  0  4  0]
 [ 0  0  0  0 20]]
Naive Bayes accuracy is: 83.33%

2 Gaussian Naive Bayes¶

In [55]:
from sklearn.naive_bayes import GaussianNB

NBclassifier2 = GaussianNB()
NBclassifier2.fit(x_train, y_train)

y_pred = NBclassifier2.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

NBAcc2 = accuracy_score(y_pred,y_test)
print('Gaussian Naive Bayes accuracy is: {:.2f}%'.format(NBAcc2*100))
              precision    recall  f1-score   support

       DrugY       0.63      0.97      0.77        34
       drugA       1.00      0.20      0.33         5
       drugB       0.75      1.00      0.86         3
       drugC       1.00      0.50      0.67         4
       drugX       1.00      0.35      0.52        20

    accuracy                           0.70        66
   macro avg       0.88      0.60      0.63        66
weighted avg       0.80      0.70      0.66        66

[[33  0  1  0  0]
 [ 4  1  0  0  0]
 [ 0  0  3  0  0]
 [ 2  0  0  2  0]
 [13  0  0  0  7]]
Gaussian Naive Bayes accuracy is: 69.70%

Decision Tree¶

Decision tree builds classification or regression models in the form of a tree structure. It utilizes an if-then rule set which is mutually exclusive and exhaustive for classification. The rules are learned sequentially using the training data one at a time. Each time a rule is learned, the tuples covered by the rules are removed. This process is continued on the training set until meeting a termination condition.

The tree is constructed in a top-down recursive divide-and-conquer manner. All the attributes should be categorical. Otherwise, they should be discretized in advance. Attributes in the top of the tree have more impact towards in the classification and they are identified using the information gain concept.

A decision tree can be easily over-fitted generating too many branches and may reflect anomalies due to noise or outliers. An over-fitted model has a very poor performance on the unseen data even though it gives an impressive performance on training data. This can be avoided by pre-pruning which halts tree construction early or post-pruning which removes branches from the fully grown tree.

https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623

In [56]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

DTclassifier = DecisionTreeClassifier(criterion='gini', max_depth=10, random_state=0)
DTclassifier = DTclassifier.fit(x_train, y_train)

y_pred = DTclassifier.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

DTAcc = accuracy_score(y_pred,y_test)
print('Decision Tree accuracy is: {:.2f}%'.format(DTAcc*100))
              precision    recall  f1-score   support

       DrugY       0.76      0.65      0.70        34
       drugA       0.44      0.80      0.57         5
       drugB       0.75      1.00      0.86         3
       drugC       0.50      1.00      0.67         4
       drugX       0.88      0.70      0.78        20

    accuracy                           0.71        66
   macro avg       0.67      0.83      0.71        66
weighted avg       0.75      0.71      0.72        66

[[22  5  1  4  2]
 [ 1  4  0  0  0]
 [ 0  0  3  0  0]
 [ 0  0  0  4  0]
 [ 6  0  0  0 14]]
Decision Tree accuracy is: 71.21%
In [57]:
#fig, axes = plt.subplots(nrows = 1,ncols = 1, figsize = (20,20), dpi=600)
#tree.plot_tree(DTclassifier, max_depth = 10, feature_names = X.columns, filled=True)
#plt.show()

Random Forest¶

Random forest is a commonly-used machine learning algorithm, which combines the output of multiple decision trees to reach a single result. Its ease of use and flexibility have fueled its adoption, as it handles both classification and regression problems.

https://www.ibm.com/cloud/learn/random-forest

In [58]:
from sklearn.ensemble import RandomForestClassifier

RFclassifier = RandomForestClassifier(max_leaf_nodes=30)
RFclassifier.fit(x_train, y_train)

y_pred = RFclassifier.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

RFAcc = accuracy_score(y_pred,y_test)
print('Random Forest accuracy is: {:.2f}%'.format(RFAcc*100))
              precision    recall  f1-score   support

       DrugY       1.00      0.59      0.74        34
       drugA       0.56      1.00      0.71         5
       drugB       0.60      1.00      0.75         3
       drugC       0.50      1.00      0.67         4
       drugX       0.83      1.00      0.91        20

    accuracy                           0.79        66
   macro avg       0.70      0.92      0.76        66
weighted avg       0.87      0.79      0.79        66

[[20  4  2  4  4]
 [ 0  5  0  0  0]
 [ 0  0  3  0  0]
 [ 0  0  0  4  0]
 [ 0  0  0  0 20]]
Random Forest accuracy is: 78.79%

Model Comparison¶

In [59]:
compare = pd.DataFrame({'Model': ['Logistic Regression', 'K Neighbors', 'SVM', 'Categorical NB', 'Gaussian NB', 'Decision Tree', 'Random Forest'], 
                        'Accuracy': [LRAcc*100, KNAcc*100, SVCAcc*100, NBAcc1*100, NBAcc2*100, DTAcc*100, RFAcc*100]})
compare.sort_values(by='Accuracy', ascending=False)
Out[59]:
Model Accuracy
0 Logistic Regression 86.363636
2 SVM 83.333333
3 Categorical NB 83.333333
6 Random Forest 78.787879
1 K Neighbors 77.272727
5 Decision Tree 71.212121
4 Gaussian NB 69.696970
In [69]:
sns.set_theme(style="darkgrid")
sns.set(rc={"figure.figsize":(12, 6)})
sns.barplot(data=compare.sort_values(by='Accuracy', ascending=False), x='Model', y='Accuracy', palette="mako_r")
plt.ylabel('Accuracy Percentage')
plt.xlabel('Model')
plt.title('Model Accuracy')
plt.show()